Analysis of bibliometric indicators for individual scholars in a large data set 
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Citation numbers and other quantities derived from bibliographic databases are becoming stan- 
dard tools for the assessment of productivity and impact of research activities. Though widely used, 
still their statistical properties have not been well established so far. This is especially true in the 
case of bibliometric indicators aimed at the evaluation of individual scholars, because large-scale 
data sets are typically difficult to be retrieved. Here, we take advantage of a recently introduced 
large bibliographic data set, Google Scholar Citations, which collects the entire publication record of 
individual scholars. We analyze the scientific profile of more than 30, 000 researchers, and study the 
relation between the /i-index, the number of publications and the number of citations of individual 
scientists. While the number of publications of a scientist has a rather weak relation with his/her 
/i-index, we find that the ft-index of a scientist is strongly correlated with the number of citations 
that she/he has received so that the number of citations can be effectively be used as a proxy of 
the /i-index. Allowing for the /i-index to depend on both the number of citations and the number 
of publications, we find only a minor improvement. 



I. INTRODUCTION 

Bibliographic databases play nowadays a crucial role 
in modern science. Citation numbers, or other measures 
derived from bibliographic data, are commonly used as 
quantitative indicators for the impact of research activ- 
ities. Citation analysis has been criticized [Tl [2^ ISH]. 
and the true meaning of a citation can be very different 
from context to context [S1[T7]. Despite these objections, 
the use of citations is widespread and citation numbers 
are currently and frequently used for assessing the im- 
pact of individual scholars [12l [20] , journals [15] , depart- 
ments [in], universities and institutions [55]. Especially 
at the level of individual scientists, numerical indicators 
based on citation counts are evaluation tools of funda- 
mental importance for decisions about hiring [4] and/or 
grant awards [5]. 

Though widely used, numerical indicators based on cita- 
tion numbers are generally poorly understood |28j . Even 
in the basic case of citation distributions of papers, where 
data are easily collectable and analyzable, there is no 
clear general picture. Depending on the study performed 
and the data set analyzed, citation distributions have 
been judged compatible with several possible statistical 
distributions: power-law functions |llM 41j. log-normal 
distributions 39j 47l |48] , stretched exponentials [211 ES] , 
and others. At the same time, however, while researchers 
have not yet reached an agreement on the precise law 
governing citation distributions, interesting properties in 
citation data are nevertheless visible and detectable [55] . 
Up to now, large-scale statistical analyses have been lim- 
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ited to the study of citations accumulated by papers 
[38l|49] or journals [43l[50]. In these cases data are easily 
collected from the main bibliographic databases avail- 
able on the market and do not require special filtering 
procedures. Conversely, the collection of bibliographic 
data about individual scholars is much more difficult. 
Simple searches on bibliographic databases are gener- 
ally unable to produce clean data sets because of evident 
problems related to the proper disambiguation of scien- 
tists. All studies conducted so far have been therefore 
limited either to small sets of scientists [21 [5J [HI HOI [321~ 
[33 [m [H], or to data sets subjected to disambiguation 
problems [7l[28l[40]. 

Here, we take advantage of a data set composed of more 
than 30, 000 [about two orders of magnitude larger than 
those used by Bar-Ilan 3 , Bornmann et al [S], Costas 
and Bordons [9j, Hirsch]20j, Petersen et al [32 ] [33 l [M ] 
155] . Redner [35] j Schreiber et al [H]] individual scientific 
profiles. The data set is rather clean because profiles are 
directly managed by scientists themselves, who are in- 
terested in providing correct information about the out- 
come of their research activity. We perform an initial 
exploratory analysis of this data set, and show that the 
main basic quantities used in research evaluation exer- 
cises obey well-defined statistical distributions. We then 
use the data set to investigate (on a scale more than 10 
times larger than previous studies) the relation between 
the /i-index and other simple bibliometric indicators. 



II. DATA SET 

We collected the scientific profiles of 89, 786 
scientists from Google Scholar Citations (GSC, 
scholar.google.com/citations) database. The 

profile of each scholar reports the entire publication 
record of the scientist, including the year of publication 



and the number of citations accumulated by each 
pubhcation according to the Google Scholar database 
[for studies about differences in the quantification 
of bibliometric indicators of individual scientists be- 
tween Google Scholar and other popular bibliographic 
databases see |2l 13 |22l |23l [3l]]- The profile is owned 
and managed by scientists themselves, who can delete 
and add publications, even merge two publications if 
considered initially as different in the database, and 
thus provides a clean source of information. Scientists 
are requested to validate their profile by providing their 
academic email address. This validation ensures that 
the profile is actually managed by the scientist. Finally, 
each scientist is required to provide a set of keywords 
which identify the research fields in which the scientist 
is active. 

Data have been collected between June 29 and July 4, 
2012. Number of publications and citations reflect there- 
fore the research activity performed until that time. We 
used an iterative procedure consisting in downloading 
the entire set of authors using an initial keyword (we 
used "network science"), adding the other keywords 
used by these scientists, downloading the profile of new 
scientists that are using these new keywords, and so 
on, until we were able to discover neither new scientists 
nor new keywords. In total, we were able to identify 
67, 648 different keywords (see Fig. 1 for a word cloud of 
the most common keywords). It is important to notice 
that the database is in rapid evolution and growth. For 
example, we used the same procedure described above 
to download data in March 2012, and at that time the 
data set was composed of 49,365 scientists and 38,679 
keywords. 

In order to be sure about the information provided by 
users, have a better control of the publication record 
of individual scientists, and include only scholars with 
a sufficiently long period of activity and sufficiently large 
number of publications, we fflter the data set with the 
following restrictions: 

1. We restrict our analysis to the 83, 897 scientists who 
validated their profile with an academic email. 

2. We delete from the data set publications that were 
published before year 1945. This was necessary in 
order to exclude from the data set papers whose 
year of publication is wrong in GCS. Note also that 
scientists with first publication before year 1945, if 
still active and with a validated profile, would have 
academic ages longer than 67 years. 

3. We further restrict the attention to scientists with 
at least 20 publications and career length longer 
than or equal to 5 years (the academic age or career 
length is measured as the difference between the 
publication years of the first paper of a scholar and 
year 2012). 

In the rest of the paper, we will present the result of 



the analysis based on a total of 35, 136 scholars whose 
research profile satisfies the aforementioned conditions. 



III. RESULTS 

A. General properties of the data set 

We first investigate general properties of the population 
in our data set. In Fig. [2|\, we show the composition of 
the population in terms of academic age. The probability 
density function (pdf) P{A) of the career length A can 
be reasonably well described (by graphical inspection, 
although the measured p-value does not support a good 
statistical compatibility) by a log-normal distribution 

p(^)^^^e-[iog(A)-HV(2^^) ^ (1) 

and the best estimate of parameters (obtained with least 
square fit) of the distribution are fiA = 2.89 and cta = 
0.51 (the suffix A is used to indicate that the parameters 
have been calculated for the academic age A). 
The number of citations C received by each scientist (C is 
the sum of all citations accumulated by all papers written 
by an author) is well fitted by (see Fig. [2j3) 



P(C)^^e- 



(2) 



where z = °^^ , and the best estimates of the pa- 
rameters are Oq ~ 6.42 and tq — 1.22. Eq. [2] is a gen- 
eralization to the logarithms of the well-known Gumbel 
function that usually appears in the description of the 
statistics of extreme values. 

In Figs. [2p and D, we report the pdfs of the number 
of publications N and the /i-index, respectively. In these 
cases, we tried to fit the distributions with both Eqs. [l] 
and [2J but none of them was able to describe entirely 
the pdfs obtained with data. It is, however, interesting 
to note that the pdf of the number of publications per 
author is neither a strict decreasing function nor a power- 
law function as often assumed in the literature [13j . but 
instead the pdf in Fig. [2p shows a clear peak and a decay 
faster than a power-law at large values of N. 
In general, the results presented in Fig. [2] depend on the 
choices we made in the selection of the authors. For 
example, the peak position of the P {A) [Fig. [2K] moves 
from j4 = 14 to A = 6 if we remove the restriction on 
the minimal number of publications needed to enter in 
the sample. Similar considerations are also valid for the 
other pdfs. On the other hand, our choices do not affect 
the tail of the pdfs, and more generally their shapes. For 
example, even if the peak moves to lower values when we 
include all authors in the data set, the pdf of Fig. [2j\ 
still can be reasonably well described by a log-normal 
distribution. 
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Figure 1: Word cloud of the most common keywords associated with the academic profiles in our data set. 



B. Relation between the /i-index and other 
indicators. 

In this subsection we test empirically some relations 
between the /i-index and other bibliometric indicators, 
which have been proposed in the literature. 

Already in his original paper, Hirsch himself presented 
a very simple model for the accumulation of citations, 
which implies a correlation between the h-mdex and the 
total number C of citations received by an individual 

h - C^/'^Kc (3) 

with exponent ah^c = 2. A correlation of this type be- 
tween h and C has been verified empirically for small 
data sets [3ll|42l|46!. In Fig. ^ we plot, for each au- 
thor in our data set, the ft,-index vs. the number of 
citations accrued C, in log-log scale. The correlation 
is rather strong (linear correlation coefficient measured 
in log-log scale Rh,c = 0.95) supporting the hypothesis 
of a scaling relationship between these two quantities. 
Since we are interested in finding the scaling between h 
and C that would provide the strongest relation between 
them, we determine the best estimate of ah.c ^ the one 
that produces the most localized distribution of the ratio 
X = C^^"'^-'^' /h. We quantify the localization of the distri- 
bution of X by means of the so-called coefficient of varia- 
tion a^/ix), i.e., the ratio between the standard deviation 
and the average value of x [8, 19 . The best estimate of 
the power-law exponent in Eq. [3] is obtained as the value 
of ah^c that minimizes the coefhcient of variation. This 
way, we find l/ah,c = 1/2.39 = 0.42 (see Fig.jsj^), which 
is quite close to Hirsch's original prediction. As addi- 
tional fitting procedures, we also calculated the best es- 



timate of ah.c as the one minimizing the kurtorsis of the 
distribution of a;, or the one minimizing the mean square 

displacement x^ = J2ii^i ^ ^h,c C^^°"''^) , where the 
sum runs over all authors in the data set and a^.c is an 
additional fitting parameter. In all cases, we find similar 
values for the best estimate of ah,c- 

To further characterize the relation between h and 
C, we study the statistical properties of the quantity 
C^/&h,c/h in Fig. |3^. We find that the distribution is 
narrowly peaked around a value close to 1, and that 
can be nicely fitted by the log-Gumbel distribution of 
Eq. [2] whose best parameter estimates are C'h.c = 0.08 
and Th.c — 0.14. 

The relation reported in Eq. |3] between h and C can 
be easily derived [T3] assuming that the distribution of 
the number of citations accrued by the publications of a 
single author is a power-law /(c) = Kc~°' {K being a 
normalization constant). The exponent 1/a < 1/2 found 
numerically implies a distribution /(c) decaying with an 
exponent slightly larger than 2. If the distribution is a 
perfect power-law and a > 2, the same power-law rela- 
tion also holds between h and the total number of publi- 
cations iV miinj: hr^ iV^/". Since the distribution /(c) 
is not a perfect power-law over the whole range of c val- 
ues, it is worth checking empirically the validity of such a 
power-law relationship, by allowing the scaling exponent 
to be possibly different from l/a;i_c 



/l^Afl/ah.iv^ 



(4) 



In Fig. I4K we find that h and N are correlated {Rh.n = 
0.72), although less than in the previous case. The best 
estimate for the exponent, again obtained by minimizing 
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Figure 2: A. Probability density function of the academic age of the scientists in the data set. Data are fitted with a log-normal 
distribution [Eq. [I] with parameters values (Ia = 2.89 and a a = 0.51 (red dashed line). B. Probability density function of 
the total number of citations received by the scientists in the data set. Data are fitted with a log-Gumbel distribution [Eq. [2] 
with parameters values Oc ~ 6.42 and tc ~ 1.22 (red dashed line). C. Probability density function of the total number of 
publications produced by the scientists in the data set. D. Probability density function of the h-index of the scientists in the 
data set. 



the coefRcient of variation, is ah,N = 2.0. Notice that 
this value indicates that h depends differently on C and 
on N, thus contradicting the hypothesis that /(c) is, for 
all scholars, a pure power-law over its whole range. In 
Fig. ^ we plot the pdf of the quantity N°-^°/h, which 
is approximately fitted by a log-normal distribution with 
parameters values fih,N = —0.64 and ah,N — 0.39. Also 
in this case we find quite a narrow distribution, but 
the value of the coefficient of variation indicates a worse 
agreement with data with respect to one found for h vs. 
C . Indeed, the inset of Fig. [3]^ shows that the minimum 
coefficient of variation (corresponding to a.h,c — 2.39) is 
around 0.19, while the minimum in the inset of Fig. l4K 
is around 0.45. 

Under the power-law assumption for /(c), it is also pos- 
sible to express h as, a. function of both N and the av- 
erage number of citations per paper x — C/N, obtain- 



ing [mils] 

fl r^ -y(o'h,c,N~'i-)/ah,c,N J\J'l/ah,c,N t r^\ 

Minimizing the coefficient of variation for the quan- 
tity C'^'^h,c,N~i)/ah.c.N j^{2-ah.c.N)/ah,c,N 11^^ ^q deter- 
mine the best estimate of a^^cN as a^^c.N = 1-70 (inset 
of Fig.^K), implying h ~ (70.41 ]\}0.is, _ ^n the main panel 
of Fig.^5]A. we plot, for each author, h vs. (70-4i jyO-is 
finding a good agreement with the expected linear be- 
havior. Also in this case we find that the quantities are 
correlated {Rh,c-N = 0.94). The pdf of the rescaled in- 
dex CO-^i N^-^^/h is weU peaked (Fig. [S^), and the dis- 
tribution can be reasonably well fitted by a log-Gumbel 
distribution with best-fit parameters Of^cN ~ 0.79 and 
Th,c,N == 0.14. 

As the relation h ~ (70.41^0.18 gj^Q-^^g^ i]^q dependence 
of ft- on C is much stronger than the one on N . The 
comparison between the minimum coefficients of varia- 
tion measured for the three scaling assumptions (Eqs. pi 
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Figure 3: A. Relation between the /i-index and the number of citations C for each scientist in our data set. We fit data with a 
power-law function [Eq. [3], whose best estimate of the exponent equals l/ah,c ~ 1/2.39 — 0.42 (dashed line). B. Probability 
density function of the quantity C'^''^'^ /h. This function is fitted by a log-Gumbel distribution [Eq. 12] with parameter values 
i'h,c ~ 0.08 and Th,c = 0.14 (dashed line). The inset shows the same as the main plot but in a double-logarithmic scale. 
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Figure 4: A. Relation between the /i-index and the number of publications A^ for each scientist in our data set. We fit data 
with a power-law function, whose best estimate of the exponent equals ah,N = 0.50 (dashed line). B. Probability density 
function of the quantity N^'^'^ /h. This function is fitted by a log-normal distribution with parameter values jIh.n ~ —0.64 and 
&h,N ~ 0.39 (dashed line). The inset shows the same as the main plot but in a linear-logarithmic scale. 



I4] and [6]) indicates that allowing for a dependence on 
both C and N leads only to a marginal improvement over 
considering only the dependence on C (the coefficient of 
variation changes only from 0.19 to 0.18) while a depen- 
dence only on N performs definitely worse. The presence 
of a term dependent on A'' in Eq. |6] brings only a little 
improvement and leaves the exponent of the dependence 
on C practically unaltered (0.41 vs. 0.42). 



IV. CONCLUSIONS 

Statistical analysis of bibliometric indicators devoted 
to the evaluation of individual scholars is usually diffi- 



cult because of the lack of large and clean data sets de- 
scribing accurately the publication records of researchers. 
This is a general problem that regards every biblio- 
graphic database available on the market, and is also 
the main reason for which the studies performed so 
far on the characterization of the bibliographic pro- 
file of individual scientists have been rarely based on 
more than 1,000 individuals. In recent years, how- 
ever, some main bibliographic databases have started to 
allow individual scientists to freely manage their pub- 
lication profiles with specially designed on-line tools. 
This is the case of the recently created ResearcherlD 
by Thomson Reuters fhttp : //www . researcher id . com') , 
Mendelcy profiles (http : //www . mendeley . comj and also 
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Figure 5: A. Relation between the /i-index, the number of citations C and the number of publications A'' for each scientist in 
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Th,c,N ~ 0.14 (dashed line). The inset shows the same as the main plot but in a double-logarithmic scale. 



of Google Scholar Citations 

(scholar.google.com/citations). In all these online 
administration systems, scholars manage directly their 
profiles by adding, deleting and correcting their publica- 
tion records, and thus the information provided can be 
considered accurate because it is in the interest of re- 
searchers to provide an accurate and up-to-date source 
of information regarding their research production. 
Here, we took advantage of the entire data set of Google 
Scholar Citations as of June 29, 2012. The data set is 
composed of more than 30, 000 individual scholars work- 
ing in research institutions worldwide. Although Google 
Scholar Citations represents a relatively clean set of data 
describing the academic records of individual researchers, 
it is important to stress that our set of data does not 
represent a random sample of researchers because the 
presence of a researcher in the system is subjected to 
various types of factors that, as a matter of fact, bias 
the sample. First, our data set is mainly composed of 
relatively young scientists (Fig. 2A) who are able to cre- 
ate a profile, validate it with their email, and manage 
it. Second, the profiles that compose the data set are 
certainly those of scholars who want to promote their 
research with the use of modern information-technology 
tools. Finally, although the scientists present in our data 
set have various fields of expertise, some scientific dis- 
ciplines are clearly over-represented and others under- 
represented (see Fig. 1). We would like to further em- 
phasize that the entire data set analyzed here is clearly 
subjected to all the limitations of the Google Scholar 
database (eventual presence of fake publications, dupli- 
cation of citations, etc.) that have been deeply studied 
in the literature [IS l l24 l |26] . 

Taking into account the formerly mentioned limitations 
of our data set, here we provided an exploratory analy- 



sis on some basic statistics for single authors and focus 
in some detail on the relation between the /i-index, the 
number of publications and the number of citations of 
individual scientists. Wc found three main results: 

1. The /i- index h is strongly correlated with the to- 
tal number of citations C received by a scientist 
with his/her own scientific production. In partic- 
ular, we find h ~ (j0A2 -^^ qualitative agreement 
with the early hypothesis by Hirsch [20] validated 
empirically on small data sets [371 IHl SS] ■ 

2. The /i- index is also shown to be correlated with the 
number of publications N, but this relation is much 
less precise than the one observed for C. 

3. It is possible to combine both dependencies into 
a single power-law relation h ^ qoai j\jo.ia^ This 
law, however, provides only a slight improvement 
with respect to the power-law relation that con- 
nects only h and C. 

Our results represent a large-scale validation of for- 
merly postulated conjectures. While the exact values of 
the measured power-law exponents might be data set de- 
pendent, we believe that the main message has a validity 
that goes beyond the data analyzed here: the total num- 
ber of citations C received by a scientist can be used as 
a effective proxy of his/her /i-index. 

The fact that h is strongly correlated with C and much 
more weakly with the total number of publications N is 
evidence that the distribution /(c) of citations accrued by 
publications of a single researcher is not a pure power- 
law over its whole range. Please note that we do not 
exclude the possibility that this fact is a consequence of 
the sample used in our analysis, where scholars have aca- 
demic age typically shorter than the one of an average 



researcher and thus individual citation distributions may 
have not yet reached as sufficient level of stationarity. 
Our results, however, call for more work to better char- 
acterize and understand the activity and citations profile 
of individual scholars, their common features and their 



variations. The large data set provided by Google Scholar 
Citations constitutes an ideal tool for this endeavor. The 
present study represents only a first attempt to scratch 
the surface of such a treasure trove. 
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